21456: perf : Optimize count distinct using bitmaps instead of hashsets for smaller datatypes#307
21456: perf : Optimize count distinct using bitmaps instead of hashsets for smaller datatypes#307martin-augment wants to merge 19 commits intomainfrom
Conversation
…inct' into optimize_count_distinct
WalkthroughThis pull request adds code review guidelines for AI agents and introduces four new specialized accumulator implementations for ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Code ReviewThe approach of using bitmaps/bool-arrays for small integer types in Bug:
|
🤖 Augment PR SummarySummary: Improves COUNT(DISTINCT) performance for small integer types by replacing HashSet-based tracking with compact bitset/array tracking. Changes:
Technical Notes: 🤖 Was this summary useful? React with 👍 or 👎 |
| } | ||
|
|
||
| fn size(&self) -> usize { | ||
| size_of_val(self) + 256 |
There was a problem hiding this comment.
In size(), size_of_val(self) already includes the inline [bool; 256] storage, so the extra + 256 double-counts memory. This can skew DataFusion’s memory accounting/limits for COUNT DISTINCT on small integer types.
Severity: medium
Other Locations
datafusion/functions-aggregate-common/src/aggregate/count_distinct/native.rs:331
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
value:useful; category:bug; feedback: The Augment AI reviewer is correct! The seen field is an array, so it is stored in the stack space and thus is included in the calculated value by size_of_val(). It should not be counted manually too. Prevents overcalculating the size of the accumulator.
| ), | ||
| // Small int types - cold path | ||
| DataType::UInt8 | DataType::Int8 | DataType::UInt16 | DataType::Int16 => { | ||
| get_small_int_accumulator(data_type).unwrap() |
There was a problem hiding this comment.
This change routes UInt8/Int8/UInt16/Int16 through new bitmap/bool-array accumulators, but there don’t appear to be unit tests covering these new distinct paths (especially state()/merge_batch() behavior). Adding coverage here would help catch regressions in distinct correctness across partial/final aggregation.
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
value:good-to-have; category:bug; feedback: The Augment AI reviewer is correct! Test cases for this new code path are not added to the Pull Request and they may not behave as desired and also could regress in the future.
There was a problem hiding this comment.
Code Review
This pull request introduces optimized COUNT DISTINCT accumulators for small integer types (u8, i8, u16, i16) using boolean arrays and bitmaps, and updates the count aggregate function to utilize these specialized paths. Additionally, it adds guidelines for AI agents to avoid reviewing specific configuration files or linking to external repositories. The review feedback suggests several performance optimizations, including bypassing Option iteration in update_batch when nulls are absent and using trailing_zeros() for faster bitmap processing, as well as correcting memory size calculations to avoid double-counting fixed-size arrays.
| for value in arr.iter().flatten() { | ||
| self.seen[value as usize] = true; | ||
| } |
There was a problem hiding this comment.
Iterating over the array using iter().flatten() is inefficient for primitive types because it performs an option check for every element. Since this is a performance-oriented optimization, it's better to check null_count() and use values() directly when no nulls are present.
if arr.null_count() == 0 {
for &value in arr.values() {
self.seen[value as usize] = true;
}
} else {
for value in arr.iter().flatten() {
self.seen[value as usize] = true;
}
}There was a problem hiding this comment.
value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! The flatten() operation checks each item in the array whether it is Some or None. This could be avoided for no-nulls arrays by using the constant check arr.null_count() == 0. This single check could avoid N checks for None, where N is the size of the array.
| } | ||
|
|
||
| fn size(&self) -> usize { | ||
| size_of_val(self) + 256 |
There was a problem hiding this comment.
value:useful; category:bug; feedback: The Gemini AI reviewer is correct! The seen field is an array, so it is stored in the stack space and thus is included in the calculated value by size_of_val(). It should not be counted manually too. Prevents overcalculating the size of the accumulator.
| for value in arr.iter().flatten() { | ||
| self.seen[value as u8 as usize] = true; | ||
| } |
There was a problem hiding this comment.
Similar to the u8 accumulator, using iter().flatten() is less efficient than direct slice access when nulls are absent. Checking null_count() allows for a faster path using values().
if arr.null_count() == 0 {
for &value in arr.values() {
self.seen[value as u8 as usize] = true;
}
} else {
for value in arr.iter().flatten() {
self.seen[value as u8 as usize] = true;
}
}There was a problem hiding this comment.
value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! The flatten() operation checks each item in the array whether it is Some or None. This could be avoided for no-nulls arrays by using the constant check arr.null_count() == 0. This single check could avoid N checks for None, where N is the size of the array.
| } | ||
|
|
||
| fn size(&self) -> usize { | ||
| size_of_val(self) + 256 |
There was a problem hiding this comment.
value:useful; category:bug; feedback: The Gemini AI reviewer is correct! The seen field is an array, so it is stored in the stack space and thus is included in the calculated value by size_of_val(). It should not be counted manually too. Prevents overcalculating the size of the accumulator.
| for value in arr.iter().flatten() { | ||
| self.set_bit(value); | ||
| } |
There was a problem hiding this comment.
value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! The flatten() operation checks each item in the array whether it is Some or None. This could be avoided for no-nulls arrays by using the constant check arr.null_count() == 0. This single check could avoid N checks for None, where N is the size of the array.
| for (word_idx, &word) in self.bitmap.iter().enumerate() { | ||
| if word != 0 { | ||
| for bit in 0..64 { | ||
| if (word & (1u64 << bit)) != 0 { | ||
| values.push((word_idx as u16) * 64 + bit); | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Iterating through all 64 bits for every non-zero word is inefficient, especially for sparse bitmaps. Using trailing_zeros() to find set bits is significantly faster and more idiomatic for bitmap processing.
for (word_idx, &word) in self.bitmap.iter().enumerate() {
let mut w = word;
while w != 0 {
let bit = w.trailing_zeros();
values.push((word_idx as u16) * 64 + bit as u16);
w &= !(1u64 << bit);
}
}There was a problem hiding this comment.
value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! By using u64::trailing_zeros() the algorithm could be optimized to process only the 1s in the bitmap from back to front. This way it will iterate only the 1s which could lead to a big gain for sparse bitmaps.
| for value in arr.iter().flatten() { | ||
| self.set_bit(value); | ||
| } |
There was a problem hiding this comment.
value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! The flatten() operation checks each item in the array whether it is Some or None. This could be avoided for no-nulls arrays by using the constant check arr.null_count() == 0. This single check could avoid N checks for None, where N is the size of the array.
| for (word_idx, &word) in self.bitmap.iter().enumerate() { | ||
| if word != 0 { | ||
| for bit in 0..64 { | ||
| if (word & (1u64 << bit)) != 0 { | ||
| values.push(((word_idx as u16) * 64 + bit) as i16); | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Iterating through all 64 bits for every non-zero word is inefficient. Using trailing_zeros() to find set bits is significantly faster.
for (word_idx, &word) in self.bitmap.iter().enumerate() {
let mut w = word;
while w != 0 {
let bit = w.trailing_zeros();
values.push(((word_idx as u16) * 64 + bit as u16) as i16);
w &= !(1u64 << bit);
}
}There was a problem hiding this comment.
value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! By using u64::trailing_zeros() the algorithm could be optimized to process only the 1s in the bitmap from back to front. This way it will iterate only the 1s which could lead to a big gain for sparse bitmaps.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit b564773. Configure here.
|
|
||
| fn size(&self) -> usize { | ||
| size_of_val(self) + 256 | ||
| } |
There was a problem hiding this comment.
Bool array accumulators double-count memory in size()
Low Severity
The size() method for BoolArray256DistinctCountAccumulator and BoolArray256DistinctCountAccumulatorI8 returns size_of_val(self) + 256, but the seen: [bool; 256] field is stored inline in the struct — not on the heap. size_of_val(self) already includes the 256-byte array, so adding 256 double-counts it, reporting ~512 bytes instead of 256. Contrast this with the Bitmap65536 variants, which correctly add 8192 for the heap-allocated Box<[u64; 1024]> that size_of_val only accounts for as a pointer. This overestimate in memory tracking could cause premature spilling or other unnecessary memory-pressure actions.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit b564773. Configure here.
There was a problem hiding this comment.
value:useful; category:bug; feedback: The Bugbot AI reviewer is correct! The seen field is an array, so it is stored in the stack space and thus is included in the calculated value by size_of_val(). It should not be counted manually too. Prevents overcalculating the size of the accumulator.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
datafusion/functions-aggregate-common/src/aggregate/count_distinct/native.rs (1)
275-324: Add signed round-trip coverage before this ships.The
Int8andInt16variants rely on remapping throughu8/u16plus customstate()serialization. A focusedupdate_batch -> state -> merge_batch -> evaluatetest around-128,-1,0,127andi16::MIN,-1,0,i16::MAXwould lock down the edge values most likely to regress.Also applies to: 462-511
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@datafusion/functions-aggregate-common/src/aggregate/count_distinct/native.rs` around lines 275 - 324, Add unit tests that exercise signed integer round-trip for the Int8 and Int16 distinct-count accumulators: specifically create tests that call update_batch on BoolArray256DistinctCountAccumulatorI8 (and the Int16 counterpart), then call state(), then merge_batch() with that state, and finally evaluate() to assert correct counts for edge values (-128, -1, 0, 127 for i8 and i16::MIN, -1, 0, i16::MAX for i16). Ensure tests cover both direct update_batch inputs and list-encoded states used by merge_batch so the remapping through u8/u16 and the custom state() serialization is validated end-to-end (use the functions/methods update_batch, state, merge_batch, evaluate and the accumulator types named in the diff).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@datafusion/functions-aggregate/src/count.rs`:
- Around line 191-193: The DataType::UInt16 | DataType::Int16 match branch
currently forces creation of the full- bitmap accumulator via
get_small_int_accumulator(data_type).unwrap(), causing eager 8 KiB allocations
per group; change this branch to return the lazy/adaptive accumulator used for
other small-int types (i.e., the accumulator that grows with cardinality and
only materializes a bitmap when needed) instead of the fixed bitmap one so
DISTINCT keeps one growable Accumulator per live group; update the
DataType::UInt16 | DataType::Int16 arm to use the existing lazy/adaptive factory
(or the previous get_small_int_accumulator variant that is lazy) and ensure
types/unwrap usage match the surrounding code.
---
Nitpick comments:
In
`@datafusion/functions-aggregate-common/src/aggregate/count_distinct/native.rs`:
- Around line 275-324: Add unit tests that exercise signed integer round-trip
for the Int8 and Int16 distinct-count accumulators: specifically create tests
that call update_batch on BoolArray256DistinctCountAccumulatorI8 (and the Int16
counterpart), then call state(), then merge_batch() with that state, and finally
evaluate() to assert correct counts for edge values (-128, -1, 0, 127 for i8 and
i16::MIN, -1, 0, i16::MAX for i16). Ensure tests cover both direct update_batch
inputs and list-encoded states used by merge_batch so the remapping through
u8/u16 and the custom state() serialization is validated end-to-end (use the
functions/methods update_batch, state, merge_batch, evaluate and the accumulator
types named in the diff).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 886dab1a-6bd0-429f-99f2-f9821614f257
📒 Files selected for processing (5)
.cursor/rules.mdAGENTS.mddatafusion/functions-aggregate-common/src/aggregate/count_distinct.rsdatafusion/functions-aggregate-common/src/aggregate/count_distinct/native.rsdatafusion/functions-aggregate/src/count.rs
| DataType::UInt8 | DataType::Int8 | DataType::UInt16 | DataType::Int16 => { | ||
| get_small_int_accumulator(data_type).unwrap() | ||
| } |
There was a problem hiding this comment.
Avoid routing all UInt16/Int16 groups to a fixed 8 KiB accumulator.
The UInt16/Int16 half of this branch now selects accumulators that eagerly allocate the full bitmap. DISTINCT still uses one Accumulator per live group here, so sparse groups go from “grow with cardinality” to “always 8 KiB”. On wide GROUP BY COUNT(DISTINCT ...) plans that is a sizable memory regression and can force early spill/OOM. Please keep the u16/i16 path lazy/adaptive instead of switching every case to the bitmap.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@datafusion/functions-aggregate/src/count.rs` around lines 191 - 193, The
DataType::UInt16 | DataType::Int16 match branch currently forces creation of the
full- bitmap accumulator via get_small_int_accumulator(data_type).unwrap(),
causing eager 8 KiB allocations per group; change this branch to return the
lazy/adaptive accumulator used for other small-int types (i.e., the accumulator
that grows with cardinality and only materializes a bitmap when needed) instead
of the fixed bitmap one so DISTINCT keeps one growable Accumulator per live
group; update the DataType::UInt16 | DataType::Int16 arm to use the existing
lazy/adaptive factory (or the previous get_small_int_accumulator variant that is
lazy) and ensure types/unwrap usage match the surrounding code.
value:useful; category:bug; feedback: The Claude AI reviewer is correct! The |
value:good-to-have; category:bug; feedback: The Claude AI reviewer is correct! |
value:good-to-have; category:bug; feedback: The Claude AI reviewer is correct! By using |


21456: To review by AI